NVTrans‐UNet: Neighborhood vision transformer based U‐Net for multi‐modal cardiac MR image segmentation

Abstract With the rapid development of artificial intelligence and image processing technology, medical imaging technology has turned into a critical tool for clinical diagnosis and disease treatment. The extraction and segmentation of the regions of interest in cardiac images are crucial to the diagnosis of cardiovascular diseases. Due to the erratically diastolic and systolic cardiac, the boundaries of Magnetic Resonance (MR) images are quite fuzzy. Moreover, it is hard to provide complete information using a single modality due to the complex structure of the cardiac image. Furthermore, conventional CNN‐based segmentation methods are weak in feature extraction. To overcome these challenges, we propose a multi‐modal method for cardiac image segmentation, called NVTrans‐UNet. Firstly, we employ the Neighborhood Vision Transformer (NVT) module, which takes advantage of Neighborhood Attention (NA) and inductive biases. It can better extract the local information of the cardiac image as well as reduce the computational cost. Secondly, we introduce a Multi‐modal Gated Fusion (MGF) network, which can automatically adjust the contributions of different modal feature maps and make full use of multi‐modal information. Thirdly, the bottleneck layer with Atrous Spatial Pyramid Pooling (ASPP) is proposed to expand the feature receptive field. Finally, the mixed loss is added to the cardiac image to focus the fuzzy boundary and realize accurate segmentation. We evaluated our model on MyoPS 2020 dataset. The Dice score of myocardial infarction (MI) was 0.642 ± 0.171, and the Dice score of myocardial infarction + edema (MI + ME) was 0.574 ± 0.110. Compared with the baseline, the MI increases by 11.2%, and the MI + ME increases by 12.5%. The results show the effectiveness of the proposed NVTrans‐UNet in the segmentation of MI and ME.

become the main disease risk. 3 In clinical practice, accurate segmentation of cardiac substructure and diseased tissue is the essential premise of cardiovascular disease (CVD) diagnosis, prevention, and assisting doctors in treatment. 4,5 Cardiac images are generally manually segmented by doctors or experts based on existing medical knowledge, clinical experience, and medical conditions, but this method is time-consuming, energyconsuming, and highly subjective. 6 At the same time, they can also lead to inaccurate segmentation in the case of long-term fatigue work. Therefore, for the precise diagnosis and prompt treatment of cardiovascular diseases, automated segmentation is crucial.
Despite many methods that have been applied for cardiac segmentation, the cardiac is a dynamic organ and its shape will change during the beating process, which is prone to noise and artifacts. The difficulty of localization and segmentation increases. In cardiac image segmentation, the shape and working mode of each region is different, which makes the segmentation algorithm of pathological regions difficult to achieve the desired effect. In addition, due to the difficulty in extracting complex features of cardiac structure, the fuzzy boundary of pathological regions in the cardiac image and the limitation of unimodal information, the segmentation task still has a great challenge. In summary, our primary aim is to analyze how to achieve accurate and efficient segmentation of pathological regions (MI and ME) in cardiac images.
Over the past few years, several CMR image segmentation techniques have been proposed. We usually divide the advanced methods into the unimodal method and multi-modal method in recent years.
Firstly, some advanced methods have recently been developed for unimodal segmentation of cardiac images. For example, in 2021, Bi et al. 7 proposed a model to extract the left ventricle and discussed a sequential shape similarity (SSS), which is based on the Active Contour Model. The base model can accurately outline the boundary of the left ventricle with the snake contour algorithm under the constraint of SSS. Ammar et al. 8 evaluated a U-Net-based variant, a 2D convolution long short-term memory (LSTM) recurrent neural network, by capturing the potential relevance. Cui et al. 9 developed an attention U-Net architecture, which pays more attention to the region of interest (ROI) with distinct sizes and shapes automatically, highlights the required parts while suppressing irrelevant regions. It effectively solves the problem of a high imbalance between the ROI and background regions.The accuracy of cardiac image segmentation is effectively improved. Nevertheless, in clinical practice, the limitations of unimodal obstruct the practical use of computer-assisted diagnostics.
Due to the limitations of unimodal, multi-modal segmentation has attracted more and more attention. Doctors can more accurately localize and diagnose lesion regions with the help of multi-modal information. Therefore, amount of multi-modal segmentation methods have been introduced recently, and the segmentation results show the superiority of multi-modal compared with unimodal. For example, in 2019, Zhou et al. 10 presented a review of multi-modal segmentation methods and analyzed the fusion strategies of different network structures. In 2020, Zhang et al. 11 introduced a multi-modal cardiac pathology segmentation architecture using a fusion algorithm. The architecture is mainly composed of two neural networks: anatomical structure segmentation network (ASSN) and pathological regions segmentation network (PRSN). Liao et al. 12 designed a multi-modal transfer learning network based on adversarial training for 3D cardiac segmentation. The spatial attention mechanism is introduced to optimize feature extraction and remove redundant information. To address the difficult segmentation caused by the diversity of lesion regions. In 2022, Li et al. 13 evaluated a Siamese U-Net, which first explored the correlation among multi-modal, and secondly extracted ROI features to improve the fusion of information. In summary, there are several applications for the multi-modal image in the diagnosis of cardiac diseases. Due to the uneven grayscale of MRI images and different imaging methods showing different tumor substructures, it is necessary to set multi-modal segmentation tasks by using the characteristics of multi-modal MRI images.
The misalignment of multi-modal due to various scanning directions, and the low tissue contrast of special modalities make multi-modal segmentation tasks great challenges. 14 This paper mainly studies the multi-modal cardiac segmentation, in which the automatic segmentation of MI and MI + ME is the focus of the research. Herein, we design NVTrans-UNet for multi-modal cardiac image segmentation. Three parallel encoders are used to extract features from three modalities, 15 to adapt to the difference in the pixel intensity distribution of each modality. The summary of our major contributions is as follows: 1. In the encoding phase, we leverage an efficient hierarchical neighborhood Transformer, namely NVT. NVT utilizes overlapping small convolution kernels for feature embedding and down sampling, which pays more attention to local information and has low complexity. 2. We introduce the MGF network to each Transformer layer of the encoder. The network aggregates the feature maps of task-related information in the three modalities and automatically learns to adjust the contribution of the three modal feature maps. 3. A bottleneck layer with ASPP is added between the encoder and the decoder, which can accurately capture information of different scales and increase the capacity to express detailed features, as well as the ability to recognize and segment small objects. 4. The mixed loss is introduced to optimize the network, allowing the model to focus on the boundaries of myocardial pathology while simultaneously resolving the issue of class imbalance.
The following is how this article is organized. We introduce the network architecture for cardiac segmentation in section 2, and the related work of different modules under the overall framework. Section 3 describes the dataset, experimental configuration, and evaluation metric. Section 4 describes the experimental results and ablation experiments. The fifth section is the discussion and prospect, and the sixth section is the summary.

METHOD
We introduce the presented NVTrans-UNet network and describe thoroughly the overall design of the network, Neighborhood Vision Transformer (NVT), multi-modal gated fusion module (MGF), Atrous Spatial Pyramid Pooling (ASPP), and loss function.

Network architecture
The NVTrans-UNet architecture is mainly made up of the following modules: encoder module, bottleneck layer, and decoder module. In general, we introduce NVT and MGF to fully utilize the local and multi-modal information respectively. ASPP is added to the bottleneck layer to expand the receptive field, decrease the number of parameters and enhance the capacity to extract detailed features. Figure 1 demonstrates the overview architecture of our model.

Neighborhood vision transformer
Since self -attention (SA) cannot handle long sequences in Transformer architecture, it needs large memory and high time complexity for high-resolution images. Therefore, we leverage NVT, which is built on a simple and flexible attention mechanism NA, which localizes the receptive field of each token to its nearest neighboring pixel. The multi-headed neighborhood attention block is shown in Figure 2. NA is a localization of SA, in which complexity is linear for the resolution and also for the neighborhood size. Compared to SA, NA not only reduces the computational cost but also includes local inductive biases similar to the convolution operation.
The model starts with a convolutional downsampler and then consists of three sequential layers, each con-sisting of multiple NVT blocks. The first layer of NVT downsamples the input feature using two consecutive 3 × 3 convolutions with 2 × 2 strides to make the spatial size 1 4 of the input size. Each NVT block consists of multi-headed neighborhood attention (NA), a multilayered perceptron (MLP), layer norm (LN), and skip connections. Figure  . It is computed as follows: where (i, j) denotes the neighborhood of a pixel at (i, j). || (i, j)|| = S 2 (S is the neighborhood size), Q, K, and V are linear projections of input X, B i, j is the relative position bias.

Multi-modal gated fusion
Because different modalities reflect diverse substructures of the cardiac, their contribution weights in various cardiac sub-structures differ. But the existing classical fusion strategies have the problem that the contributions of different modalities cannot be dynamically balanced. 16 Consequently, we introduce a multi-modal gated fusion (MGF) strategy, which can automatically learn to adjust the contribution of feature mapping from each modal. At this point, we learn a weighted mapping dynamically to control the proportion from each modal information. Then, the features of each modal are fused. To realize the multi-scale fusion of multi-modal information, we perform the MGF module in each NVT layer in the encoder. The MGF is superior to the existing conventional fusion strategies by properly aggregating the complementary information with correlation weight. Figure 4 depicts the MGF module structure. Specifically, the MGF concatenates the features from each downsampling layer and inputs them to the upsampling layers with three output channels. Then the weight matrix G is generated by the sigmoid activation function. This matrix can be divided into three independent maps of {W 1 , W 2 , W 3 }, which are multiplied by the feature maps of three modalities, and then the outputs are concatenated. Each weight is a trainable parameter. During the training phase, the optimizer continuously updates the parameters to minimize the loss function. Eventually, we get the fusion result F. The output has the same feature mapping size and channel number as the input.

F I G U R E 1
The overview architecture of the NVTrans-UNet segmentation model.

F I G U R E 2 Multi-headed neighborhood attention block.
Mathematically, the MGF can be expressed as: where is the sigmoid function, and (x) 1+e −x , ⊕ represents the concatenation, C i and C j are the convolution kernel size, ⊗ denotes the element-wise product, F ′′ (i) refers to ith feature map of F, b i , and b j are biases of the convolution layers.

Atrous spatial pyramid pooling
The bottleneck layer provides a connection between the encoder and decoder,offering multi-scale information for semantic segmentation tasks. Therefore, we add ASPP to the bottleneck layer to increase the receptive field and robustly segment object regions at different levels. 17 The structure of ASPP is shown in Figure 5.
In order to extract contextual information in the multiscale features and recover spatial information from diverse field-of -views, ASPP employs filters with multiple sampling rates and effective field-of -view to detect incoming convolutional feature layers. Four parallel distributed atrous convolutions (r = {1, 6, 12, 18}) are used to combine features from distinct receptive fields. The features extracted from each sampling layer are further processed in various branches and concated to produce the final result. As well as increasing the feature receptive field, ASPP can also improve the recognition and detection of small objects. The ASPP can be stated mathematically as: where i represents the kernel size of input atrous convolution, p denotes the size of the padding, and s indicates stride size. k is the size of the original convolution kernel, d stands for dilation-rate, k + (k − 1)(d − 1) is the convolution kernel size after the dilation-rate is inserted. O refers to the size of the feature map after atrous convolution.

Loss function
Since the pathological regions in the cardiac image are relatively small, the segmentation of the cardiac image is unbalanced. In unbalanced segmentation, there may be large differences among different regions, and the boundary of the pathological regions of the cardiac image is fuzzy, which will affect the training effect. 18 We create a multi-scale structural similarity index measure (MS-SSIM) 19 loss function to apply larger weights to the boundary in order to further improve the boundary of pathological regions. The larger MS-SSIM value indicates a larger regional distribution difference. The segmentation result p and the ground truth label t are cropped into two identical N×N sized patches, which have been aligned with each other. Let p = {p i |i = 1, 2, … , N 2 } and t = {t i |i = 1, 2, … , N 2 }, the luminance, contrast, structure comparison measures and MS-SSIM loss function are given as follows: where p , t are the mean and p , t are the standard deviations of the predicted image and the ground truth respectively. pt represents the covariance between the predicted image and the ground truth. C 1 , C 2 , and C 3 are constants. M denotes the total number of scales. M , m , and m are employed to modify the three elements' respective relative weights.
By combining MS-SSIM loss, 19 Tversky loss, 20 and Focal loss, 21 we design a mixed loss function for segmentation in patch level and pixel level hierarchies, which can capture fine structures with clear boundaries.
In order to reduce the weight of massive volumes of negative data in training, the Focal loss function is primarily employed to address the issue of class imbalance between positive and negative samples. 22 The following is the Focal loss formula: Y p represents the predicted segmentation result and Y t represents the ground-truth. (1 − Y p ) represents the modulating factor. indicates the focusing parameter. The impact of the modulation factor similarly grows as increases. The purpose of adding this modulation factor is to reduce the weight of easy-to-classify pixels so that the model can focus more on hard-to-classify pixels during training.
In the case of sample imbalance,misclassification can lead to a large increase in loss, leading to unstable optimization. Abraham et al. 23 studied the similarity index based on Tversky, which introduced two parameters ( and ) to control false positives (FP) and false negatives (FN), the balance between FP and FN can be adjusted by and . The Tversky loss formula is: Y t and Y p refer to the real label and the prediction result respectively. The sum of and is 1. Consequently, the mixed loss function is as follows: Where the , , and are trade-off parameters weighting the impact of each term, is empirically set as 5 and = 3 in our experiments.

Dataset and preprocessing
We test our model using the MyoPS 2020 dataset for myocardial pathology segmentation, which contains three modalities, including Late Gadolinium enhancement (LGE) CMR, T2-weighted CMR, and balanced steady-state free precession (bSSFP) CMR.
LGE CMR is bright in the region of MI, it can be used to identify regions of inflammation, cardiomyopathy and infarct 24 but its anatomical and edema boundaries are fuzzy. The bSSFP CMR can capture cardiac motion and the boundary between the myocardium and blood cavity. 25 T2-weighted is helpful to distinguish acute myocardial infarction from distal myocardial infarction.It could simultaneously segment MI and ME by combining the three modalities.MyoPS 2020 contains 25 labeled (102 slices) multi-sequence CMR images and 20 unlabeled (72 slices) images. We first slice the cardiac volumes into 2D images. Considering that the ROI only occupies a small part of the whole image, then crop the center to 288 × 288 pixels. Given the small number of samples, we implement a data argumentation strategy by random warping and rotation. Firstly, the random warping is achieved by generating an 8 × 8 × 2 evenly distributed random matrix. Then, we adjusted the size of the nonrigid warping matrix to 288 × 288 × 2, and used the bi-linear interpolation method to process the warping map. After using random warping to augment the data, we selected with equal probability at 90 • , 180 • , and 270 • using random rotation. Figure 6 shows the input multi-modal cardiac image slice and Figure 7 shows the schematic diagram of preprocessing approach. For all cardiac images, there are two pathological masks (myocardial infarction and edema) and three anatomy masks (myocardium, left ventricle, and right ventricle).

Experimental settings
The NVTrans-UNet is implemented by Python based on TensorFlow (version 1.8.0). The hardware configuration uses a GeForce RTX 2080 Ti GPU with 11G memory. On the MyoPS 2020 public dataset, the training set and F I G U R E 6 The input multi-modal cardiac image slice.

F I G U R E 7
Schematic diagram of center region preprocessing.

Evaluation metric
The evaluation metrics used are Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) in this experiment. DSC and HD are often used to evaluate the quality of medical image segmentation. DSC measures the similarity between the segmentation results and ground truth. The specific formulation of DSC is as follows: In Equation (14), T is the label, and P is the segmentation result of the evaluation method. The value range of Dice score is between 0 and 1.
HD is a measure that describes the degree of similarity between two point sets. The HD can be formulated as: In Equation (15), where T denotes labels and P represents predicted results, S P and S T are elements in two sets respectively. d express for the Euclidean distance.

Experiment results
To verify the effectiveness of the employed NVTrans-UNet in multi-modal cardiac image segmentation, we compared it with other advanced methods. These methods include baseline MFU-Net, 26 FCDensenet, 27 FADLS, 28 U-Net, 29 PyMIC, 30 MVMM, 14 and CMRadjustNet. 31 The results show that the Dice score of MI was 0.642 ± 0.171, and the Dice score of MI + ME was 0.574 ± 0.110. Table 1 demonstrates the comparison of segmentation results with different methods. Compared with FCDensenet, the Dice score of our model increased by 3.4% and 6.3% in MI + ME and MI, respectively. Compared with CMRadjustNet, the Dice score of our model increased by 3.2% and 1.4% in MI + ME and MI, respectively. Table 2 demonstrates that the segmentation results of our method in the cardiac pathological regions are significantly better than other methods. The MGF module in NVTrans-UNet can take full advantage of the complementarity among multimodal information. The NVT module can extract the local information of the cardiac image. The ASPP can obtain more effective receptive fields and reduce the F I G U R E 8 Visualization comparison of pathological regions segmentation effects of multi-modal cardiac images under different ablation studies. The segmented regions include MI and ME. Where ASPP represents Atrous Spatial Pyramid Pooling, MGF represents multi-modal gated fusion, ML represents mixed loss function, NVT represents Neighborhood Vision Transformer, and GT represents ground truth, respectively. parameter size of the model. In addition, the boundary loss can enhance the boundary of the cardiac image, and give higher weights to the fuzzy boundary. All these show that the proposed NVTrans-UNet has better segmentation performance.

Ablation study
To evaluate the impact of each component, we conducted detailed ablation experiments to quantify individual performance of different modules we introduced. Tables 2 and 3  that the model we proposed has better performance in segmenting small-size regions MI and ME. Figures 10 and 11 demonstrate the evaluation results of different ablation experiments on multi-modal cardiac images. As can be seen from the figures, the NVTrans-UNet presents better results, which shows the effectiveness of the NVT, MGF, and ASPP modules. It is worth noting that the gold standard label provided includes five parts, while myocardial infarction and edema are mostly considered.

DISCUSSION
In this paper, a multi-modal cardiac image segmentation method NVTrans-UNet is proposed. First of all, in the encoding phase, we introduce the NVT module. NVT is a hierarchical Transformer consisting of multiple neighborhood attention layers. The neighborhood attention adaptively locates the receptive field to the neighborhood around each token without the need for extra operations. And we introduce the local induction bias to reduce the computational cost. Moreover, we introduce the MGF module to distinct convolution layers of the encoder. The module aggregates the feature maps of task-related information and automatically learns to adjust the contribution of the three modalities. Then, we introduce ASPP into the bottleneck layer. ASPP uses atrous convolution parallel sampling with different sampling rates, which can expand the receptive field to improve the segmentation capacity of small targets. Because the pathological regions of the cardiac image are usually relatively small and the boundary is fuzzy, the traditional segmentation loss effect is not ideal. We propose a mixed loss function that combines MS-SSIM loss,Tversky loss,and Focal loss for segmentation at the pixel level and patch level, which can capture fine structures with clear boundaries and reduce the misclassification of the target region. By redesigning the network structure and loss function, although NVTrans-UNet shows better performance in multi-modal cardiac image dataset, our model deals with 2D images and cannot fully utilize the 3D infor-mation of the data. We will make full use of the 3D information of the image in the future. Due to the nearby normal tissues will affect the segmentation precision of pathology regions.Subsequently,we will further explore the relationship between myocardial pathology and healthy tissues to enhance the segmentation precision of lesion regions. In addition, we will try to extend it to other small target segmentation tasks.

CONCLUSION
Automatic segmentation of cardiac images can assist doctors to diagnose cardiovascular diseases promptly, which has important clinical value. Therefore, we introduce a multi-modal cardiac segmentation model NVTrans-UNet based on deep learning to segment small target regions MI and ME, which provide a reliable diagnostic basis for physicians to make accurate judgments. With the advantage of NVT and MGF modules, NVTrans-UNet can compensate for missing local information and fuse multi-modal features dynamically which improves the accuracy of segmentation. On the MyoPS 2020 dataset, we obtained competitive results, the Dice score of MI was 0.642 ± 0.171 and the Dice score of MI + ME was 0.574 ± 0.110. The results of the experiments demonstrate the potential of our NVTrans-UNet.

AU T H O R C O N T R I B U T I O N S
Bingjie Li devised the project, performed the experiments, and drafted the manuscript. Tiejun Yang provided critical revision of the manuscript for important intellectual content, technical, and material support. Xiang Zhao contributed to the design of this study and the revision of the manuscript. All authors reviewed the results and approved the final of the manuscript.